import glob
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_files = sorted(glob.glob("../data/*"))
data_files
['../data/Folsom_NAM_lat38.579454_lon-121.260320.csv', '../data/Folsom_NAM_lat38.599891_lon-121.126680.csv', '../data/Folsom_NAM_lat38.683880_lon-121.286556.csv', '../data/Folsom_NAM_lat38.704328_lon-121.152788.csv', '../data/Folsom_irradiance.csv', '../data/Folsom_satellite.csv', '../data/Folsom_sky_image_features.csv', '../data/Folsom_weather.csv', '../data/Irradiance_features_day-ahead.csv', '../data/Irradiance_features_intra-day.csv', '../data/Irradiance_features_intra-hour.csv', '../data/NAM_nearest_node_day-ahead.csv', '../data/Sat_image_features_intra-day.csv', '../data/Sky_image_features_intra-hour.csv', '../data/Target_day-ahead.csv', '../data/Target_intra-day.csv', '../data/Target_intra-hour.csv']
target = pd.read_csv("../data/Folsom_irradiance.csv")
target.describe().apply(lambda s: s.apply("{0:.5f}".format))
| ghi | dni | dhi | |
|---|---|---|---|
| count | 1552320.00000 | 1552320.00000 | 1551702.00000 |
| mean | 208.54445 | 259.86476 | 54.26736 |
| std | 295.33004 | 363.15687 | 84.32237 |
| min | 0.00000 | 0.00000 | 0.00000 |
| 25% | 0.00000 | 0.00000 | 0.00000 |
| 50% | 3.23000 | 0.00000 | 3.79550 |
| 75% | 386.90000 | 662.80000 | 79.56000 |
| max | 1466.00000 | 1004.00000 | 748.10000 |
target.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1552320 entries, 0 to 1552319 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timeStamp 1552320 non-null object 1 ghi 1552320 non-null float64 2 dni 1552320 non-null float64 3 dhi 1551702 non-null float64 dtypes: float64(3), object(1) memory usage: 47.4+ MB
target.head()
| timeStamp | ghi | dni | dhi | |
|---|---|---|---|---|
| 0 | 2014-01-02 08:00:00 | 0.0 | 0.0 | 0.0 |
| 1 | 2014-01-02 08:01:00 | 0.0 | 0.0 | 0.0 |
| 2 | 2014-01-02 08:02:00 | 0.0 | 0.0 | 0.0 |
| 3 | 2014-01-02 08:03:00 | 0.0 | 0.0 | 0.0 |
| 4 | 2014-01-02 08:04:00 | 0.0 | 0.0 | 0.0 |
The term solar irradiance represents the power from the sun that reaches a surface per unit area. Direct irradiance is the part of the solar irradiance that directly reaches a surface; diffuse irradiance is the part that is scattered by the atmosphere; global irradiance is the sum of both diffuse and direct components reaching the same surface.
On the other hand, the term solar irradiation represents the sum of energy per unit area received from the sun over a specific period of time. In the Global Solar Atlas, we provide three magnitudes related to solar irradiation:
GHI, Global Horizontal Irradiation
DNI, Direct Normal Irradiation
DIF, Diffuse Horizontal Irradiation
GHI and DIF are referred to a surface horizontal to the ground, while DNI is referred to a surface perpendicular to the Sun. Higher values of DIF/GHI ratio represent a higher occurrence of clouds, higher atmospheric pollution or higher water vapor content.
type(target.timeStamp[0])
str
target.timeStamp = pd.to_datetime(target.timeStamp)
target.timeStamp.agg(["min", "max"])
min 2014-01-02 08:00:00 max 2016-12-31 07:59:00 Name: timeStamp, dtype: datetime64[ns]
# the brief said a good model was obtained with image, lets look
# at the features generated from said images
image_processed = pd.read_csv("../data/Folsom_sky_image_features.csv")
image_processed.head()
| timestamp | AVG(R) | STD(R) | ENT(R) | AVG(G) | STD(G) | ENT(G) | AVG(B) | STD(B) | ENT(B) | AVG(RB) | STD(RB) | ENT(RB) | AVG(NRB) | STD(NRB) | ENT(NRB) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-01-01 00:00:11 | 102.8933 | 45.8083 | 5.6373 | 121.5468 | 39.5426 | 5.6390 | 133.4322 | 30.8409 | 5.4275 | 0.7457 | 0.1647 | 4.8206 | -0.1554 | 0.1033 | 4.2279 |
| 1 | 2014-01-01 00:01:10 | 109.1193 | 44.9516 | 5.6762 | 128.0568 | 38.7453 | 5.6535 | 139.4049 | 30.2546 | 5.4146 | 0.7601 | 0.1547 | 4.7120 | -0.1447 | 0.0953 | 4.0961 |
| 2 | 2014-01-01 00:02:10 | 118.4310 | 44.4158 | 5.6386 | 129.2313 | 39.1756 | 5.6381 | 134.9957 | 30.8004 | 5.4368 | 0.8591 | 0.1453 | 4.7299 | -0.0822 | 0.0821 | 3.9662 |
| 3 | 2014-01-01 00:03:11 | 108.0799 | 46.3934 | 5.6447 | 129.5778 | 39.5050 | 5.6392 | 142.6288 | 30.9485 | 5.4369 | 0.7347 | 0.1625 | 4.8007 | -0.1626 | 0.1034 | 4.2324 |
| 4 | 2014-01-01 00:04:11 | 106.7813 | 45.5549 | 5.6539 | 126.3533 | 39.0664 | 5.6438 | 137.2983 | 30.4596 | 5.4168 | 0.7541 | 0.1613 | 4.7906 | -0.1494 | 0.1003 | 4.1823 |
image_processed.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 775916 entries, 0 to 775915 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 timestamp 775916 non-null object 1 AVG(R) 775916 non-null float64 2 STD(R) 775916 non-null float64 3 ENT(R) 775916 non-null float64 4 AVG(G) 775916 non-null float64 5 STD(G) 775916 non-null float64 6 ENT(G) 775916 non-null float64 7 AVG(B) 775916 non-null float64 8 STD(B) 775916 non-null float64 9 ENT(B) 775916 non-null float64 10 AVG(RB) 775916 non-null float64 11 STD(RB) 775916 non-null float64 12 ENT(RB) 775916 non-null float64 13 AVG(NRB) 775916 non-null float64 14 STD(NRB) 775916 non-null float64 15 ENT(NRB) 775916 non-null float64 dtypes: float64(15), object(1) memory usage: 94.7+ MB
image_processed.describe()
| AVG(R) | STD(R) | ENT(R) | AVG(G) | STD(G) | ENT(G) | AVG(B) | STD(B) | ENT(B) | AVG(RB) | STD(RB) | ENT(RB) | AVG(NRB) | STD(NRB) | ENT(NRB) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 | 775916.000000 |
| mean | 137.707012 | 47.618662 | 5.802955 | 141.824441 | 43.323574 | 5.692493 | 147.171596 | 36.249612 | 5.476716 | 0.919222 | 0.138455 | 4.329280 | -0.049773 | 0.073537 | 3.532675 |
| std | 12.999919 | 6.828796 | 0.214288 | 10.191951 | 6.213200 | 0.206817 | 9.908800 | 5.722567 | 0.223504 | 0.081037 | 0.057342 | 0.541403 | 0.050370 | 0.030394 | 0.562946 |
| min | 64.814600 | 17.987500 | 3.820400 | 80.354100 | 15.836900 | 3.712900 | 89.490200 | 14.130800 | 3.638700 | 0.425400 | 0.000000 | -0.000000 | -0.462900 | 0.000000 | -0.000000 |
| 25% | 129.155500 | 43.600375 | 5.652300 | 134.654700 | 39.400200 | 5.553200 | 140.709000 | 32.618000 | 5.338100 | 0.883300 | 0.107700 | 4.073000 | -0.069500 | 0.053200 | 3.192700 |
| 50% | 139.574750 | 47.185800 | 5.765400 | 142.773000 | 43.156250 | 5.644700 | 147.631950 | 36.227600 | 5.449800 | 0.941000 | 0.134500 | 4.459100 | -0.034400 | 0.067800 | 3.595200 |
| 75% | 146.688500 | 51.074500 | 5.936000 | 148.652700 | 46.858300 | 5.811100 | 153.671400 | 39.853400 | 5.592000 | 0.971200 | 0.160300 | 4.703400 | -0.017300 | 0.086500 | 3.932300 |
| max | 189.981700 | 86.451500 | 6.473000 | 189.981700 | 77.633900 | 6.470500 | 200.078600 | 83.527200 | 6.439500 | 2.166300 | 5.016400 | 6.182500 | 0.226300 | 0.323900 | 5.697500 |
# lets check variables for correlation
corr_image_processed = image_processed.corr().abs()
corr_image_processed[corr_image_processed > 0.5][
corr_image_processed < 1
].style.background_gradient(cmap="turbo")
/tmp/ipykernel_229035/2321413705.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. corr_image_processed = image_processed.corr().abs()
| AVG(R) | STD(R) | ENT(R) | AVG(G) | STD(G) | ENT(G) | AVG(B) | STD(B) | ENT(B) | AVG(RB) | STD(RB) | ENT(RB) | AVG(NRB) | STD(NRB) | ENT(NRB) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AVG(R) | nan | 0.599443 | nan | 0.917145 | nan | nan | 0.534310 | nan | nan | 0.751218 | nan | 0.676228 | 0.758439 | 0.723061 | 0.764767 |
| STD(R) | 0.599443 | nan | 0.552332 | 0.610294 | 0.954014 | 0.559567 | nan | 0.735437 | nan | nan | nan | nan | nan | 0.685824 | 0.528716 |
| ENT(R) | nan | 0.552332 | nan | nan | 0.571528 | 0.933143 | nan | 0.556546 | 0.719275 | nan | nan | nan | nan | nan | nan |
| AVG(G) | 0.917145 | 0.610294 | nan | nan | 0.562515 | nan | 0.788749 | nan | nan | nan | nan | 0.621855 | nan | 0.562438 | 0.653129 |
| STD(G) | nan | 0.954014 | 0.571528 | 0.562515 | nan | 0.618097 | nan | 0.886050 | 0.557194 | nan | nan | nan | nan | nan | nan |
| ENT(G) | nan | 0.559567 | 0.933143 | nan | 0.618097 | nan | nan | 0.638294 | 0.873744 | nan | nan | nan | nan | nan | nan |
| AVG(B) | 0.534310 | nan | nan | 0.788749 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| STD(B) | nan | 0.735437 | 0.556546 | nan | 0.886050 | 0.638294 | nan | nan | 0.728633 | nan | nan | nan | nan | nan | nan |
| ENT(B) | nan | nan | 0.719275 | nan | 0.557194 | 0.873744 | nan | 0.728633 | nan | nan | nan | nan | nan | nan | nan |
| AVG(RB) | 0.751218 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0.531862 | 0.995034 | 0.768661 | 0.688226 |
| STD(RB) | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0.635802 | nan | 0.684455 | 0.660575 |
| ENT(RB) | 0.676228 | nan | nan | 0.621855 | nan | nan | nan | nan | nan | 0.531862 | 0.635802 | nan | 0.550362 | 0.760177 | 0.976830 |
| AVG(NRB) | 0.758439 | nan | nan | nan | nan | nan | nan | nan | nan | 0.995034 | nan | 0.550362 | nan | 0.805550 | 0.705588 |
| STD(NRB) | 0.723061 | 0.685824 | nan | 0.562438 | nan | nan | nan | nan | nan | 0.768661 | 0.684455 | 0.760177 | 0.805550 | nan | 0.847389 |
| ENT(NRB) | 0.764767 | 0.528716 | nan | 0.653129 | nan | nan | nan | nan | nan | 0.688226 | 0.660575 | 0.976830 | 0.705588 | 0.847389 | nan |
image_processed.timestamp = pd.to_datetime(image_processed.timestamp)
from ydata_profiling import ProfileReport
# lets try ydata profiling
profile = ProfileReport(image_processed, title="Image Features Report")
profile
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]